Familiarize with data

Objective
To analyze all the chemical properties of the red wine and determine which factors might be resposible for good quality red wine.

Introduction
Looking at the variables in the dataset, there are some really interesting questions that can be answered.
- Does high content of alcohol increase the quality of red wine?
- Does high content of sugar make the red wine more tasty and hence result in higher quality product?

Let’s explore the data and find out answers to above questions as well as pictographically understand the data. We will be plotting graphs, identifying the outliers and drawing some inferences about data by looking at the various plots(historgrams, scatterplots, bar plots, etc)

Univariate Analysis

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
# Convert quality to factor
wine$quality.as_factor <- factor(wine$quality)

# Prepare data (quality) for categorization
levels(wine$quality.as_factor)
## [1] "3" "4" "5" "6" "7" "8"

Categorizing distributions of the Variables

Normal Non-normal
alcohol citric.acid
fixed.alcohol residual.sugar
volatile.acidity free.sulphur.dioxide
density total.sulphur.dioxide
pH sulphates
- chlorides

So, we need to transform the variables which do not look normal or close to. We can use the log10

Let’s further classify quality of wine into 3 ordinal variables: low(3,4), medium(5,6), best(7,8)

low <- wine$quality <= 4
medium <- wine$quality > 4 & wine$quality < 7
best <- wine$quality > 6
wine$quality.category <- factor(ifelse(low, 'low', 
                                ifelse(medium, 'medium', 'best')), 
                                levels = c("low", "medium", "best"))
levels(wine$quality.category)
## [1] "low"    "medium" "best"

BiVariate Analysis

We can use the ggpairs function to find the level of corelation between the variables so that ones with very less relevance can be skipped

Comparing each of the attributes/compounds of wine with quality and coming to conclusions

Let’s draw plots for all variables against quality of red wine and analyze which variable has a relation(positive, negative, no relation) with quality of red wine

Conclusion: The above factors depict that, they are directly proportional to quality of red wine, i.e. higher the factor, better the quality

Let’s check the remaining ones

Conclusion: The above factors denote that, they are inversely proportional to quality of red wine, i.e. lower the factor, better the quality

Criteria for picking the variables for futher analysis, the corelation coefficient > 0.5 or coefficient < -0.5

Multivariate Analysis

This signifies that, with the increase in volatile.acidity and citric.acid the quality is drastically decreased.

We can see that the points in color blue, i.e. best quality samples, lie above the corelation slope. Let’s just verify if the above plot is uniform across all the levels of quality since I can see a lot of red dots below the corelation slope

As we can see above the slope in the initial plot was correct and the relation is positive, i.e. for higher values of citric acid and fixed acidity, the perceived quality of wine is better

It can be said that, with the increase in alcohol content and decrease in density, the quality seems to be reduced. However, the ggpairs result shows a completely different story about alcohol. Increase in alcohol seems to affect quality of the wine positively.

Final plots

Let’s look at some plots that I find interesting

The above plot clearly signifies that, there is a positive corelation between density and fixed.acidity in determining the quality of wine

The above plot denotes that, higher the pH and lower the fixed.acidity, better the quality of wine. One may think that pH is a good measure for determining the qualty of wine, however, in the ggpairs plot drawn above, the corelation is not that huge to be considered. Infact, the corelation is negative.

Note: It should also be noted that, the pH value is mostly in the range of 3 to 4

# Let's run the below chunk of code to find how much percent does the value in range 3-4 for pH contribute to in the dataset
paste(round(length(wine$pH[wine$pH < 4 & wine$pH > 3])/length(wine$pH)*100), '%')
## [1] "98 %"

So, out of the total 1599 observations, approximately 98% observations have pH value in the range 3-4

The ggpairs plot also mentions sulphates to have a positive corelation with quality, however, it is also evident that, its relation with no other variable is fairly visible. Let us plot it one of the variables that it has some corelation with i.e. alcohol

The above plotting proves that, alcohol and sulphates are one of the major contributors in increasing the quality of wine

Reflection

The dataset that we just analyzed was fairly small and hence this cannot be a perfect solution to determine the quality of red wine. Another point that we can consider is that the quality measure that was given here must have been done by some experts which may vary depending on the region (geographical) the experts are from. In short, the information about experts is abstracted from us and hence the results obtained through plots should not be considered as accurate.

On the analysis front, prediction is another vast tool that can be used to get some more insights and for that again we may have to train algorithms on huge datasets without missing out on points like info about experts (geography, age, sex, etc) along with some other factors that may determine the quality like smell, texture of the color (bright. pale, etc).

EDA is perhaps the best tool to undestand the data and feel it. It is useful when you want to visualize data even before thinking of models and writing code. However, it may have some limitations since it depends on the collected data (i.e. sample).

I also emphasize on use of a learning algorithm as it will be faster and in case another subset of data is added, will be able to corelate better than we visually determining w.r.t to each dependent and independent variables.

References:

Acids in Wine

ROC curves

Rgraphics

Error

Penn State Analysis